Recently, many deep learning based beamformers have been proposed for multi-channel speech separation. Nevertheless, most of them rely on extra cues known in advance, such as speaker feature, face image or directional information. In this paper, we propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal, namely MIMO-DBnet. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source. The precisely estimated directional embedding provides quite effective spatial discrimination guidance for the neural beamformer to offset the effect of phase wrapping, thus allowing more accurate reconstruction of two sources' speech signals. Experiments show that our proposed MIMO-DBnet not only achieves a comprehensive decent improvement compared to baseline systems, but also maintain the performance on high frequency bands when phase wrapping occurs.
translated by 谷歌翻译
最近基于神经网络的到达方向(DOA)估计算法在未知数的声源场景上表现良好。这些算法通常是通过将多通道音频输入映射到单个输出(即所有来源的总空间伪谱(SP))来实现的,称为MISO。但是,这种误语算法在很大程度上取决于经验阈值设置和声音源之间的角度大于固定角度的角度假设。为了解决这些局限性,我们提出了一种新型的多通道输入和多个输出的DOA网络,称为MIMO-DOANET。与一般的误觉算法不同,Mimo-Doanet借助于信息的空间协方差矩阵预测了每个声源的SPS编码。通过这样做,检测声源数量的阈值任务成为检测每个输出中是否存在声音源的更容易的任务,并且在推理阶段,声源之间的严重交互消失。实验结果表明,与3,4个来源场景中的莫斯科基线相比,MIMO-DOANET的相对增长18.6%和绝对13.3%,相对34.4%和绝对20.2%的F1得分提高。结果还证明了Mimo-Doanet减轻了阈值设置问题,并有效地解决了角度假设问题。
translated by 谷歌翻译
双重编码器结构成功地利用了两个特定语言的编码器(LSE)进行代码转换语音识别。由于LSE由两个预训练的语言特定模型(LSM)初始化,因此双编码器结构可以利用足够的单语言数据并捕获单个语言属性。但是,现有方法对LSE的语言没有限制,并且不足以针对LSM的语言知识。在本文中,我们提出了一种特定语言的特征辅助(LSCA)方法来减轻上述问题。具体来说,在培训期间,我们引入了两种特定语言的损失作为语言限制,并为其生成相应的语言目标。在解码过程中,我们通过组合两个LSM和混合模型的输出概率来考虑LSM的解码能力,以获得最终预测。实验表明,LSCA的训练或解码方法可以改善模型的性能。此外,通过组合LSCA的训练和解码方法,最佳结果可以在代码切换测试集上获得多达15.4%的相对误差。此外,该系统可以通过使用我们的方法来很好地处理代码转换语音识别任务,而无需额外的共享参数,甚至可以基于两个预训练的LSM进行重新训练。
translated by 谷歌翻译
基于图形卷积的方法已成功应用于同质图上的表示学习,其中具有相同标签或相似属性的节点往往相互连接。由于这些方法使用的图形卷积网络(GCN)的同义假设,它们不适合异质图,其中具有不同标记或不同属性的节点往往相邻。几种方法试图解决这个异质问题,但是它们没有改变GCN的基本聚合机制,因为它们依靠求和操作员来汇总邻近节点的信息,这隐含地遵守同质假设。在这里,我们介绍了一种新颖的聚合机制,并开发了基于随机步行聚集的图形神经网络(称为RAW-GNN)方法。提出的方法将随机步行策略与图神经网络集成在一起。新方法利用广度优先的随机步行搜索来捕获同质信息和深度优先搜索以收集异性信息。它用基于路径的社区取代了传统社区,并基于经常性神经网络引入了新的基于路径的聚合器。这些设计使RAW-GNN适用于同质图和异质图。广泛的实验结果表明,新方法在各种同质图和异质图上实现了最先进的性能。
translated by 谷歌翻译
声源本地化旨在从观察到的多通道音频寻求所有声源的到达方向(DOA)。对于未知数量来源的实际问题,现有的本地化算法试图预测基于似然的编码(即空间频谱),并采用预先确定的阈值来检测源编号和相应的DOA值。但是,这些基于阈值的算法不稳定,因为它们受到仔细选择阈值的限制。为了解决此问题,我们提出了一种称为ISSL的迭代声源本地化方法,该方法可以迭代地提取每个源的DOA而无需阈值,直到满足终止标准为止。与基于阈值的算法不同,ISSL设计基于二进制分类器的活动源检测器网络,以接受残留的空间频谱并决定是否停止迭代。通过这样做,我们的ISSL可以处理任意数量的来源,甚至超过培训阶段中看到的来源数量。实验结果表明,与现有的基于阈值的算法相比,我们的ISSL在DOA估计和源数检测方面都取得了重大的性能提高。
translated by 谷歌翻译
Despite significant progress in object categorization, in recent years, a number of important challenges remain; mainly, the ability to learn from limited labeled data and to recognize object classes within large, potentially open, set of labels. Zero-shot learning is one way of addressing these challenges, but it has only been shown to work with limited sized class vocabularies and typically requires separation between supervised and unsupervised classes, allowing former to inform the latter but not vice versa. We propose the notion of vocabulary-informed learning to alleviate the above mentioned challenges and address problems of supervised, zero-shot, generalized zero-shot and open set recognition using a unified framework. Specifically, we propose a weighted maximum margin framework for semantic manifold-based recognition that incorporates distance constraints from (both supervised and unsupervised) vocabulary atoms. Distance constraints ensure that labeled samples are projected closer to their correct prototypes, in the embedding space, than to others. We illustrate that resulting model shows improvements in supervised, zero-shot, generalized zero-shot, and large open set recognition, with up to 310K class vocabulary on Animal with Attributes and ImageNet datasets.
translated by 谷歌翻译
Deploying reliable deep learning techniques in interdisciplinary applications needs learned models to output accurate and ({even more importantly}) explainable predictions. Existing approaches typically explicate network outputs in a post-hoc fashion, under an implicit assumption that faithful explanations come from accurate predictions/classifications. We have an opposite claim that explanations boost (or even determine) classification. That is, end-to-end learning of explanation factors to augment discriminative representation extraction could be a more intuitive strategy to inversely assure fine-grained explainability, e.g., in those neuroimaging and neuroscience studies with high-dimensional data containing noisy, redundant, and task-irrelevant information. In this paper, we propose such an explainable geometric deep network dubbed as NeuroExplainer, with applications to uncover altered infant cortical development patterns associated with preterm birth. Given fundamental cortical attributes as network input, our NeuroExplainer adopts a hierarchical attention-decoding framework to learn fine-grained attentions and respective discriminative representations to accurately recognize preterm infants from term-born infants at term-equivalent age. NeuroExplainer learns the hierarchical attention-decoding modules under subject-level weak supervision coupled with targeted regularizers deduced from domain knowledge regarding brain development. These prior-guided constraints implicitly maximizes the explainability metrics (i.e., fidelity, sparsity, and stability) in network training, driving the learned network to output detailed explanations and accurate classifications. Experimental results on the public dHCP benchmark suggest that NeuroExplainer led to quantitatively reliable explanation results that are qualitatively consistent with representative neuroimaging studies.
translated by 谷歌翻译
As natural language processing (NLP) for gender bias becomes a significant interdisciplinary topic, the prevalent data-driven techniques such as large-scale language models suffer from data inadequacy and biased corpus, especially for languages with insufficient resources such as Chinese. To this end, we propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels derived by following an annotation scheme specifically developed for gender bias in the Chinese context. Moreover, we address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias. We also conduct experiments with state-of-the-art language models to provide baselines. To our best knowledge, CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
translated by 谷歌翻译
We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-value transfer learning ability in few-shot scenarios. The generated editing steps also offer better interpretability and ease for interactive error correction. Extensive human evaluations further confirm its effectiveness.
translated by 谷歌翻译
Medical image segmentation (MIS) is essential for supporting disease diagnosis and treatment effect assessment. Despite considerable advances in artificial intelligence (AI) for MIS, clinicians remain skeptical of its utility, maintaining low confidence in such black box systems, with this problem being exacerbated by low generalization for out-of-distribution (OOD) data. To move towards effective clinical utilization, we propose a foundation model named EvidenceCap, which makes the box transparent in a quantifiable way by uncertainty estimation. EvidenceCap not only makes AI visible in regions of uncertainty and OOD data, but also enhances the reliability, robustness, and computational efficiency of MIS. Uncertainty is modeled explicitly through subjective logic theory to gather strong evidence from features. We show the effectiveness of EvidenceCap in three segmentation datasets and apply it to the clinic. Our work sheds light on clinical safe applications and explainable AI, and can contribute towards trustworthiness in the medical domain.
translated by 谷歌翻译